feat(ISV-7020): add chunking capabilities and cluster-error handling by BorekZnovustvoritel · Pull Request #17 · release-engineering/Retriable-Kafka-Client

BorekZnovustvoritel · 2026-03-26T10:46:48Z

This PR introduces a custom message chunking protocol. This protocol is enabled by default in retry topics only. The work required creating a new data object, MessageGroup, which holds multiple messages, the value of these messages need to be concatenated to get the transferred data.

This new feature is unit and integration tested. Consumers can now consume both normal and chunked messages seamlessly, even from the same partition, the changes were made to be backwards-compatible.

This feature cannot be turned on for producers consumed by different libraries as their clients cannot read such messages, the implemented solution is custom.

* chunking enable only in retry topics * tests assisted with Claude Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.6-Opus

BorekZnovustvoritel · 2026-04-02T12:35:49Z

+        ],
+        group_id="test-chunks",
+        timeout=30.0,
+        split_messages=True,


You can try locally that unsetting this will break the test because the message cannot fit in the topic. Which means:

Chunking fixes the issue

Chunking is off by default

Also check in integration logs for debug messages about reassembled MessageGroups.

BorekZnovustvoritel · 2026-04-02T12:38:15Z

@RichardPlesnik please take a look. This will be used for retry topics only, but requires no further configuration in any cluster or consumer services. Producer services not consumed by this library will still need to be updated with additional configuration.

bclindner

Very impressive work! Though I feel like Kafka should really have something for handling this in its wire protocol.

Some concerns forthcoming - sorry, hit early send on this review

bclindner · 2026-04-02T13:50:38Z

Whoops, that review wasn't finished, stand by...

bclindner

Implementation looks generally sound, but I'm worried about an edge-case here. Consider a situation where a producer sends all but one chunk of a message before its process is killed, for whatever reason, and the broker and consumer remain online.

Tiny problem I see here is ChunkingCache lacks any sort of expiry or method of discarding those chunks, or even a way to know when to discard them, right? I think they'd just be stuck in memory until the consumer process is restarted.
Way bigger problem is the commit offset - if I'm reading this right, if we lose any chunk of a message, wouldn't that make us unable to commit higher offsets, basically stalling the partition out?

If we have a way to address this maybe we could get some tests & ITs in to prove the edge case has a clean resolution in CI.

BorekZnovustvoritel · 2026-04-07T07:44:25Z

Thanks for raising the concern. You are right that there is currently no invalidation of the chunking cache other than successful read. However the behavior is not quite as you stated.

The messageGroup is submitted to the tracking manager only when the whole group is successfully reassembled. This means any incomplete messageGroup won't block anything as the trackingManager doesn't have information about it. This logic won't break as long as the producer sends the messageGroup to the cluster in a sequence without any interruptions.

The only scenario I can see this breaking is if the producer produces the start of the messageGroup, then some standalone message and then the rest of the group. If the consumer crashes right after the consumption of the standalone message and commits it before consuming the rest of the group, the first part of the group is lost. This could be fixed by merging chunkingCache and trackingManager to keep track of incomplete messages too.

There could also be some other incomplete messageGroups, either when the producer crashes during sending the group or if the data retention policy in the cluster destroys the first message(s) of the group. Currently, the only way of clearing them would be restarting the consumer application. I agree that some invalidation (with logged warnings) can be added to make this more robust. I will take another look at this, hopefully I will be able to address all the edgecases we currently see.

* enable chunk expiration * tests assisted with Claude Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.6-Opus

bclindner

I think your explanation + changes adequately address things, thanks and sorry for the extra work! Will need to watch out for those warnings and make sure dropped messages can be addressed anywhere this chunking has to happen (though I assume in any decent EDA you'd have contingencies for that kind of thing).

RichardPlesnik · 2026-04-09T09:26:00Z

Thanks for the great work! LGTM 👍

BorekZnovustvoritel force-pushed the chunking_wip branch 2 times, most recently from 4ca7357 to b245377 Compare April 2, 2026 11:57

BorekZnovustvoritel marked this pull request as ready for review April 2, 2026 11:57

BorekZnovustvoritel changed the title ~~draft: add chunking capabilities~~ feat(ISV-7020): add chunking capabilities Apr 2, 2026

BorekZnovustvoritel force-pushed the chunking_wip branch from b245377 to d51e1fd Compare April 2, 2026 12:05

feat(ISV-7020): add chunking capabilities

32ffcfe

* chunking enable only in retry topics * tests assisted with Claude Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.6-Opus

BorekZnovustvoritel force-pushed the chunking_wip branch from d51e1fd to 32ffcfe Compare April 2, 2026 12:26

BorekZnovustvoritel commented Apr 2, 2026

View reviewed changes

BorekZnovustvoritel requested a review from bclindner April 2, 2026 12:37

bclindner reviewed Apr 2, 2026

View reviewed changes

Comment thread src/retriable_kafka_client/chunking.py

bclindner reviewed Apr 2, 2026

View reviewed changes

Comment thread src/retriable_kafka_client/headers.py

bclindner reviewed Apr 2, 2026

View reviewed changes

Comment thread src/retriable_kafka_client/headers.py

Comment thread src/retriable_kafka_client/producer.py

Comment thread tests/unit/test_consumer.py

BorekZnovustvoritel and others added 3 commits April 7, 2026 14:43

feat(ISV-7020): enhance chunking capabilities

6922de3

* enable chunk expiration * tests assisted with Claude Signed-off-by: Marek Szymutko <mszymutk@redhat.com> Assisted-by: Claude-4.6-Opus

fix: prepare chunks with separate function to avoid code duplication

9437a14

fix(ISV-7020): fix formatting and address a review comment

3015405

bclindner approved these changes Apr 7, 2026

View reviewed changes

feat(ISV-7020): add cluster-error handling

e52da65

BorekZnovustvoritel changed the title ~~feat(ISV-7020): add chunking capabilities~~ feat(ISV-7020): add chunking capabilities and cluster-error handling Apr 8, 2026

bclindner approved these changes Apr 8, 2026

View reviewed changes

BorekZnovustvoritel merged commit ee3c975 into main Apr 9, 2026
5 checks passed

github-actions bot mentioned this pull request Apr 9, 2026

chore(main): release 0.6.0 #20

Merged

BorekZnovustvoritel deleted the chunking_wip branch April 9, 2026 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ISV-7020): add chunking capabilities and cluster-error handling#17

feat(ISV-7020): add chunking capabilities and cluster-error handling#17
BorekZnovustvoritel merged 5 commits intomainfrom
chunking_wip

BorekZnovustvoritel commented Mar 26, 2026 •

edited

Loading

Uh oh!

BorekZnovustvoritel Apr 2, 2026 •

edited

Loading

Uh oh!

BorekZnovustvoritel commented Apr 2, 2026

Uh oh!

bclindner left a comment •

edited

Loading

Uh oh!

Uh oh!

bclindner commented Apr 2, 2026

Uh oh!

Uh oh!

bclindner left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BorekZnovustvoritel commented Apr 7, 2026

Uh oh!

bclindner left a comment •

edited

Loading

Uh oh!

RichardPlesnik commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BorekZnovustvoritel commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BorekZnovustvoritel Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BorekZnovustvoritel commented Apr 2, 2026

Uh oh!

bclindner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bclindner commented Apr 2, 2026

Uh oh!

Uh oh!

bclindner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BorekZnovustvoritel commented Apr 7, 2026

Uh oh!

bclindner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RichardPlesnik commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BorekZnovustvoritel commented Mar 26, 2026 •

edited

Loading

BorekZnovustvoritel Apr 2, 2026 •

edited

Loading

bclindner left a comment •

edited

Loading

bclindner left a comment •

edited

Loading